AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Neural Information Processing SystemsMar-20-2026, 00:18:17 GMT

ClevrSkills: Compositional Language And Visual Reasoning in Robotics

Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so?

artificial intelligence, name change, proceedings, (6 more...)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Neural Information Processing SystemsDec-24-2025, 22:08:16 GMT

Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos

The analysis and use of egocentric videos for robotics tasks is made challenging by occlusion and the visual mismatch between the human hand and a robot end-effector. Past work views the human hand as a nuisance and removes it from the scene. However, the hand also provides a valuable signal for learning. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment.

agent-environment factorization, egocentric video, name change, (8 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (0.63)
Information Technology > Artificial Intelligence > Vision (0.47)

Neural Information Processing SystemsDec-24-2025, 20:27:31 GMT

Trust Region-Based Safe Distributional Reinforcement Learning for Multiple Constraints

In safety-critical robotic tasks, potential failures must be reduced, and multiple constraints must be met, such as avoiding collisions, limiting energy consumption, and maintaining balance.Thus, applying safe reinforcement learning (RL) in such robotic tasks requires to handle multiple constraints and use risk-averse constraints rather than risk-neutral constraints.To this end, we propose a trust region-based safe RL algorithm for multiple constraints called a safe distributional actor-critic (SDAC).Our main contributions are as follows: 1) introducing a gradient integration method to manage infeasibility issues in multi-constrained problems, ensuring theoretical convergence, and 2) developing a TD($\lambda$) target distribution to estimate risk-averse constraints with low biases.

constraint, name change, region-based safe distributional reinforcement learning, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

arXiv.org Artificial IntelligenceOct-15-2025

Fast Visuomotor Policy for Robotic Manipulation

Jia, Jingkai, Yang, Tong, Chen, Xueyao, Liu, Chenhuan, Zhang, Wenqiang

We present a fast and effective policy framework for robotic manipulation, named Energy Policy, designed for high-frequency robotic tasks and resource-constrained systems. Unlike existing robotic policies, Energy Policy natively predicts multimodal actions in a single forward pass, enabling high-precision manipulation at high speed. The framework is built upon two core components. First, we adopt the energy score as the learning objective to facilitate multimodal action modeling. Second, we introduce an energy MLP to implement the proposed objective while keeping the architecture simple and efficient. We conduct comprehensive experiments in both simulated environments and real-world robotic tasks to evaluate the effectiveness of Energy Policy. The results show that Energy Policy matches or surpasses the performance of state-of-the-art manipulation methods while significantly reducing computational overhead. Notably, on the MimicGen benchmark, Energy Policy achieves superior performance with at a faster inference compared to existing approaches.

artificial intelligence, arxiv preprint arxiv, baseline, (14 more...)

2510.12483

Genre: Research Report > New Finding (0.48)

Industry: Energy (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

arXiv.org Artificial IntelligenceOct-7-2025

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

Jang, Huiwon, Yu, Sihyun, Kwon, Heeseung, Jeon, Hojin, Seo, Younggyo, Shin, Jinwoo

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times. Many robotic tasks are inherently non-Markovian, i.e., the optimal decision at a given timestep t cannot be determined from the latest observation o For instance, an object may become occluded during manipulation (Shi et al., 2025). Solving long-horizon tasks may also require context about the previous motions of a robot, and handling dynamic environments often involves tracking the motion trajectories of moving objects (Zhang et al., 2025; Nasiriany et al., 2024).

artificial intelligence, arxiv preprint arxiv, contextvla, (13 more...)

2510.04246

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Neural Information Processing SystemsOct-2-2025, 18:54:10 GMT

We thank the reviewers for their constructive comments and helpful feedback

These two preconditions are commonly satisfied in many continuous robotics tasks with parameterized policy classes.

artificial intelligence, constructive comment and helpful feedback, reviewer, (14 more...)

Technology: Information Technology > Artificial Intelligence (0.54)

arXiv.org Artificial IntelligenceSep-18-2025

Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation

Chen, Zidong, Guo, Zihao, Wang, Peng, Egbe, ThankGod Itua, Lyu, Yan, Qian, Chenghao

Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks.

artificial intelligence, inference, velocity field, (15 more...)

2509.13574

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

arXiv.org Artificial IntelligenceJun-24-2025

DefFusionNet: Learning Multimodal Goal Shapes for Deformable Object Manipulation via a Diffusion-based Probabilistic Model

Thach, Bao, Kim, Siyeon, Jordan, Britton, Shanthi, Mohanraj, Watts, Tanner, Ho, Shing-Hei, Ferguson, James M., Hermans, Tucker, Kuntz, Alan

Deformable object manipulation is critical to many real-world robotic applications, ranging from surgical robotics and soft material handling in manufacturing to household tasks like laundry folding. At the core of this important robotic field is shape servoing, a task focused on controlling deformable objects into desired shapes. The shape servoing formulation requires the specification of a goal shape. However, most prior works in shape servoing rely on impractical goal shape acquisition methods, such as laborious domain-knowledge engineering or manual manipulation. DefGoalNet previously posed the current state-of-the-art solution to this problem, which learns deformable object goal shapes directly from a small number of human demonstrations. However, it significantly struggles in multi-modal settings, where multiple distinct goal shapes can all lead to successful task completion. As a deterministic model, DefGoalNet collapses these possibilities into a single averaged solution, often resulting in an unusable goal. In this paper, we address this problem by developing DefFusionNet, a novel neural network that leverages the diffusion probabilistic model to learn a distribution over all valid goal shapes rather than predicting a single deterministic outcome. This enables the generation of diverse goal shapes and avoids the averaging artifacts. We demonstrate our method's effectiveness on robotic tasks inspired by both manufacturing and surgical applications, both in simulation and on a physical robot. Our work is the first generative model capable of producing a diverse, multi-modal set of deformable object goals for real-world robotic applications.

artificial intelligence, machine learning, point cloud, (19 more...)

2506.18779

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Surgery (0.94)
Health & Medicine > Health Care Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceMay-27-2025

RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction

Lu, Weifeng, Ye, Minghao, Ye, Zewei, Tao, Ruihan, Yang, Shuo, Zhao, Bo

Vision-Language-Action (VLA) models have recently advanced robotic manipulation by translating natural-language instructions and image information into sequential control actions. However, these models often underperform in open-world scenarios, as they are predominantly trained on successful expert demonstrations and exhibit a limited capacity for failure recovery. In this work, we present a Robotic Failure Analysis and Correction (RoboFAC) framework to address this issue. Firstly, we construct RoboFAC dataset comprising 9,440 erroneous manipulation trajectories and 78,623 QA pairs across 16 diverse tasks and 53 scenes in both simulation and real-world environments. Leveraging our dataset, we develop RoboFAC model, which is capable of Task Understanding, Failure Analysis and Failure Correction. Experimental results demonstrate that the RoboFAC model outperforms GPT-4o by 34.1% on our evaluation benchmark. Furthermore, we integrate the RoboFAC model into a real-world VLA control pipeline as an external supervision providing correction instructions, yielding a 29.1% relative improvement on average on four real-world tasks. The results show that our RoboFAC framework effectively handles robotic failures and assists the VLA model in recovering from failures.

large language model, machine learning, natural language, (19 more...)

2505.12224

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)